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Abstract 

This work presents a mathematical model that establishes an in- 
teresting connection between nucleotide frequencies in human single- 
stranded DNA and the famous Fibonacci's numbers. The model relies 
on two assumptions. First, Chargaff's second parity rule should be 
valid, and, second, the nucleotide frequencies should approach limit 
values when the number of bases is sufficiently large. Under these two 
hypotheses, it is possible to predict the human nucleotide frequencies 
with accuracy. It is noteworthy, that the predicted values are solutions 
of an optimization problem, which is commonplace in many nature's 
phenomena. 

1 Introduction 

The amount of available genome data is increasing very fast due the comple- 
tion of a host of genome sequencing projects. The careful analysis of all these 
data is only beginning. The genome sequence by itself is meaningless, it is 
necessary to identify genes, proceed the annotation, and, if possible, get some 
understanding of the very process responsible by the sequence formation. 
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Less than 25% of the fly genome is in coding regions, and the number 
falls to less than 3% in humans [2J. It seems that the most part of eukaryotes 
genomes is "garbage" DNA. Nevertheless, recently, some evidences show that 
it is not the case. Mutations in noncoding regions were associated with 
cancer [7]. Consequently, the interest in noncoding regions has increased, 
and the role that those regions have in the whole genome demands a better 
comprehension . 

The initial step in any genome analysis is to perform some simple statisti- 
cal measures like frequencies and averages. These kind of research have been 
done even before the discovery of DNA structure, and allowed some striking 
scientific advances. For instance, in 1951, Chargaff [I] observed that, in any 
piece of double-stranded DNA, the frequencies of adenine and thymine are 
equal, and so are the frequencies of cytosine and guanine. In mathematical 
notation Pa = Pt and Pc = Pg, where Pa, Pc, Pg and Pt denote the nu- 
cleotide frequencies of adenine, cytosine, guanine and thymine, respectively. 
This observation is known as Chargaff 's first parity rule. Watson and Crick, 
in 1953, were acquainted with Chargaff 's first parity rule, and used it to sup- 
port their DNA double-helix model [5] . Furthermore, Chargaff also observed 
that the parity rule approximately holds in a single-stranded DNA, nonethe- 
less the equality is not strict, but Pa — Pt and Pc = Pg- This is known as 
Chargaff 's second parity rule. Possibly, the best explanation to this rule can 
be found in [I]. Chargaff 's second rule has been extensively tested |5] and 
proved to hold in the majority of the genome sequences. 

A particular interesting case is found in human genome. We have tested 
the Chargaff 's second parity rule for each one of the 24 human chromosomes 
(22 + X + Y), and it is definitely valid. Moreover, notice that Pa + Pt + 
Pc + Pg — 1 by definition (the sum of all frequencies must be equal to 1), 
and assuming Chargaff 's second parity rule, we get that Pa + Pc — \ or ) 
equivalently, Pt + Pg — \, or any possible combination. If we plot the points 
(Pa, Pc) f° r each human chromosome, we get another interesting fact: they 
are not evenly spread over the line Pa + Pc — \, but seem to be aggregated 
around some very precise values. In Figured], in red, the line Pa + Pc — \, 
and the green dots are the points (Pa, Pc) f° r each human chromosome. 

Although this observation is not expected, it is not completely unusual. 
Many phenomena in nature show the same pattern, and some of them can be 
mathematically modeled. Usually, those mathematical models that describe 
nature's phenomena involve optimization problems. It seems that nature is 
always trying to optimize itself in different contexts. Therefore, the following 
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Figure 1: In red the line Pa + Pc — \, and in green the observed points 
(Pa, Pc) for each human chromosome 

question emerges naturally: Is it possible to build a mathematical model that 
predicts or explain the observed frequencies? 

Assuming that (i) the human nucleotide frequencies really tend to limit 
values when the number os bases is sufficiently large, and (ii) Chargaff's 
second parity rule is valid, we derived a mathematical model that predicts 
the observed frequency values with accuracy. 

2 Mathematical Model 

In order to understand our model, it is necessary to introduce the Fibonacci 
numbers [3J. 

2.1 Fibonacci Numbers 

In mathematics, one of the most famous integer sequence is without doubt 
the sequence {1, 1, 2, 3, 5, 8, 13, ...}. 

This sequence, called Fibonacci sequence, is obtained through the recur- 
rence formula 
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F(n + 2) = F(n + 1) + F(n), (1) 

together with the initial conditions F(l) = 1 and F(2) = 1. 

The Fibonacci sequence was first described, in the Occident, by Leonardo 
of Pisa, also known as Fibonacci, in his book Liber Abaci. The Fibonacci 
sequence appears in nature in different contexts: sea shell shapes, flower 
petals and seeds, etc. 

It is related to the Golden Ratio, <fi, by the limit 

0=Um F( "+ 1) . (2) 

Y rwoc F(ri) 

The Golden Ratio is associated to Beauty and Perfection, and for this 
reason it is conventional to find present in art (Leonardo da Vinci), archi- 
tecture (Parthenon in Athens, for example) and music (notably in Bartok 
and Debussy). There is a plenty of written works about the Golden Ratio 
and the Fibonacci numbers. 



2.2 Assumptions and Model 

The main assumption of our model is that Chargaff 's second parity rule is 
valid in all human chromosomes. There are many different forms to state it 
mathematically. We've decided to do it in the following way: the division of 
the frequency of one nucleotide by the sum of the frequencies of the remaining 
nucleotides is in the proportion of three Fibonacci numbers. Of course, the 
choice of Fibonacci numbers were based in their generalized occurence in 
nature; although, we've also tried other sets of numbers, but with no success. 
Consider the three Fibonacci numbers below 

{F{n),F{n + l),F{n + k)} . (3) 

where n is a sufficiently large number and k = 0,1,2,3, .., N ( N is finite 
number). 

Therefore, we can write the main assumption as 

x(n) Fin) 

oc 

y(n) + z(n) + w(n) F(n + k) , 

y(n) F(n+1) 

oc 

x(n) + z{n) + w{n) F(n + ky 
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(4) 
(5) 



x(n) + y(n) + iy(n) F(n + k) ' 
u)(n) ^ F{n+ 1) 



x(n) + y(ra) + z{n) F(n + fc) ' 

where x(n),y(n), z{n) and w(n) represent the nucleotide frequencies, with- 
out any a priori association, when the number of nucleotide bases is n, i. e., 
x(n) = — , where x n stands for the number of nucleotide x. 

It is not straightforward to recognize Chargaff's second parity rule in 
equations (j4]) - (J7J). One way to grasp the idea behind the formulas is to note 
that equations (HI) and ([6]) are proportional to the same quotient ( J(n+k) ) ' 
and the same can be said about equations (jSJ) and (J7|). In next section, we 
will show how to get Chargaff's second parity rule from the above equations. 

2.2.1 Limit Values 

Now, lets impose our second assumption, i. e., that the nucleotide frequencies 
tend to limit values when n is sufficiently large. Mathematically, it can be 
written as 

(8) 
(9) 
(10) 

(11) 

It is also necessary to understand what happens with the quotients F ^"^ 
and F/ n "!~M when n — > oo. 

F(n+k) 

Using equation ([1]) recursively, it is easy to get the following recurrence 
formula 

F{n + k) = F(k)F(n + 1) + F(k - l)F(n). (12) 
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We are particularly interested in the cases where n, the numbers of bases, 
is large, and the quotient of the Fibonacci numbers tends to a limit. 

Mathematically, this can be obtained as follows. Dividing ()12p by F(n + 

k), we get 

1 = FW ^"+1> + F(fc -p/W . (13) 
Taking the limit as n — > oo, 

1 = F(fc) hm F n + 1 | + F(fc - 1) lim f ^ x , (14) 
v 1 n^oo F{n + k) y 1 "~-oo F{n + kY K J 



We define 



V = Ito (15) 



and 



A 2l , = hm f ( " } (16) 
Notice that Ai^ and A2,fc are linked to the Golden Ratio by 

Ai )fc = ^ k (17) 

and 

A 2 , fc = (18) 

respectively. 

Thus, the equation (I14p can be written as 



l = F(fc)A 1>fc + F(fc-l)A 2>fc (19) 
Finally, our model can be rewritten as 

A ljfc , (20) 

A 2 , fc , (21) 



y + z + w 

y 

X + 2 + W 
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— I 1 — ~~ 

x + y + w 

W \ 



x + y + z 

As noted before, x, y, z and w are frequencies, so we have 

x + y + z + w = l. 
Using equation fpHl) . the equations fl20l) - ff23l) . we get 



X 


1 




X 




V 




1 




y 




z 




1 




z 




w 
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Equations ( 1251) and ( 1271) imply that 

x = z 

and, analogously, equations (|26|) and (1281) imply that 

y = iy 

Equations (1291) and ( 1301) are the Chargaff's second parity rule. 
Moreover, an immediate consequence of equations (1291) . (!30|) and (|24l) 



2.3 Optimization Problem 

Now, we have three equations in two variables 



x 



1 — x 

y 

i-y 



Ai )fe (32) 
A 2 , fc (33) 



which can be rewritten as 



x + y = - (34) 



» = irfc (36) 

x + = ~ (37) 

This is a linear system, and, using equations (fT9l) and fl3Tl) . it is not difficult to 
show that it is inconsistent, independently, of k. In fact, only when k — > oo, 
the system is consistent, but we are dealing with the cases where k is finite. 

The equation f[3"Tj) must be satisfied because x and y are frequencies and, 
by definition, the equation (|24|) must hold. Therefore, we should try to 
minimize the difference between x and , \ ,fc , and the difference between y 

and : , 2 , ,fc under the condition that x + y = 

This is a classical optimization problem, and can be mathematically 
stated as 



min f k (x,y), (38) 
x+y=2 



where 



^ = {'-^t) + { v -TTt) (39) 

This minimization problem is sufficiently easy to solve, because its ob- 
jective function is quadratic and the Jacobian of the constraint is full rank, 
therefore the solution exists and is unique [6]. 
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In Table 1 we list the solutions to the first 8 values of A;. It is not difficult 
to show that (x, y) — > (0.25, 0.25) as k — > oo. 



k 


X 


X ^ 


y 


y = 







0.3090 


l+v^ 
8+4^ 


0.1909 


1 


8+4\/5 


0.3090 


l+v^ 
8+4\/5 


0.1909 


2 


127+57^ 
420+188^ 


0.3027 


83+37^ 
420+188\/5 


0.1972 


3 


161+72 V5 
550+246^ 


0.2927 


114+51 V5 
550+246^ 


0.2072 


4 


881+392^ 


0.2818 


682+305^ 


0.2181 


3126+1398^ 


3126+1398^ 


5 


20583+9205^ 


0.2723 


17211+7697^ 


0.2276 


75588+33804^ 


75588+33804^ 


6 


15908+7070^ 


0.2649 


3(9349+4181^) 


0.2350 


59665+26683^ 


119330+53366^5 


7 


100793+45076^ 


0.2597 


186459+83387^ 


0.2402 


388045+173539^5 


776090+347078^ 



Table 1: Solutions of the optimization problem for different values of k 



The values of Table 1 are in agreement with the observed frequencies 
in human chromosomes. In the next section, we will present the data that 
supports this mathematical model. 



3 Results 

We've performed a simple experiment using the nucleotide frequencies in 
human genome. The human genome data were obtained at NCB0 From 
time to time new human genome releases are deposited. We've used Build 
35.1. It is important to note that only partial data is available for each 
chromosome, i.e., there are still missing sections (for example, chromosome 
1 is supposed to have about 263 million bases, but only about 220 million 
bases were available). This information is relevant because it can explain 
some minor deviations from the predicted values. 



3.1 Human Nucleotide Frequencies 

This experiment consisted in calculating the nucleotide frequencies in all 24 
human chromosomes. The results are summarized in the following table. 



1 National Center for Biotechnology Information. Site (http://www.ncbi.nlm.nih.gov) 
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Chromosome 


Pa 


Pc 


Pg 


Pt 


k 


Chrom 


1 


0.2916 


0.2080 


0.2080 


0.2922 


3 


Chrom 


2 


0.3000 


0.2003 


0.2005 


0.2997 


2 


Chrom 


3 


0.3019 


0.1980 


0.1980 


0.3020 


2 


Chrom 


4 


0.3093 


0.1905 


0.1906 


0.3094 


1 


Chrom 


5 


0.3020 


0.1974 


0.1975 


0.3011 


2 


Chrom 


6 


0.3024 


0.1975 


0.1976 


0.3023 


2 


Chrom 


7 


0.2950 


0.2040 


0.2040 


0.2951 


3 


Chrom 


8 


0.3002 


0.2001 


0.2000 


0.2999 


2 


Chrom 


9 


0.2933 


0.2067 


0.2067 


0.2931 


3 


Chrom 


10 


0.2922 


0.2074 


0.2074 


0.2928 


3 


Chrom 


11 


0.2925 


0.2072 


0.2075 


0.2926 


3 


Chrom 


12 


0.2950 


0.2040 


0.2033 


0.2956 


3 


Chrom 


13 


0.3072 


0.1922 


0.1922 


0.3080 


1 


Chrom 


14 


0.2951 


0.2034 


0.2039 


0.2974 


3 


Chrom 


15 


0.2903 


0.2101 


0.2099 


0.2895 


3 


Chrom 


16 


0.2750 


0.2040 


0.2040 


0.2750 


4 


Chrom 


17 


0.2740 


0.2261 


0.2258 


0.2713 


5 


Chrom 


18 


0.3014 


0.1982 


0.1985 


0.3017 


2 


Chrom 


19 


0.2588 


0.2403 


0.2409 


0.2598 


7 


Chrom 


20 


0.2785 


0.2194 


0.2202 


0.2817 


5 


Chrom 


21 


0.2940 


0.2040 


0.2039 


0.2952 


3 


Chrom 


22 


0.2605 


0.2398 


0.2397 


0.2598 


6 


Chrom 


X 


0.3027 


0.1968 


0.1967 


0.3033 


2 


Chrom 


Y 


0.3098 


0.1893 


0.1889 


0.3118 


1 



Tabic 2: Nucleotide Frequencies for all human chromosomes 



The nucleotide frequencies are clustered around the predicted values of 
Table 1. In Figure El we have in red the solutions of the optimization problem 
for different values of k, and in green the points (Pa, Pc) f° r each one of the 
human chromosomes. The red circles have their centers in the solutions of 
the optimization problem and they have the same radius equal to 0.005. 

It is interesting to note that all the points (Pa, Pc) are near ( less than 
0.005) to the predicted values. 

The average values are fj,p A = 0.292, f J t Po = 0.207, /ip G = 0.207 and 
/j,p T — 0.292, which are close to the optimization's solution when k = 3. 

4 Conclusion 

Using Chargaff 's second parity rule and assuming that the nucleotide frequen- 
cies tend to limit values when the number of nucleotide bases is sufficiently 
large, we've described a mathematical model that predicts the limit values of 
the human nucleotide frequencies with great accuracy. It is also interesting 
to note that the limit values are the results of an optimization problem, and 
it is commonly found in many phenomena in nature. 

If our two hypotheses hold and our mathematical model is correct, then 
it is possible to make the following conjecture: the noncoding DNA regions 
play a major rule in the "optimization process" to reach the limit values 
predicted in our mathematical model. This conjecture is based on the fact 
that about 97% of human genome is believed to be noncoding. 
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Figure 2: In red dots, the solutions of the optimization problem for differ- 
ent values of k. In green, the points (Pa, Pc) f° r each one of the human 
chromosomes. 
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